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Introduction 

This  report  is  a  summary  of  the  work  and  findings  for  the  prostate  cancer  doctoral 
training  grant  awarded  to  me  through  the  CDMRP  project.  In  addition  to  the 
research  aims  defined  in  the  project  proposal,  there  was  a  substantive  training 
component  to  this  grant  that  will  also  be  reviewed  below.  As  a  student  within  the 
Duke  computational  biology  doctoral  program,  the  focus  of  the  research  had  two 
broad  goals:  the  development  of  methodologies  that  could  elucidate  pathway 
activation  using  cancer  genomic  data,  and  the  direct  application  of  these 
methodologies  to  improve  understanding  of  prostate  cancer  (PC)  progression. 
Both  of  these  goals  have  been  fulfilled,  and  have  culminated  in  the  completion  of 
a  peer-reviewed  publication. 


Body 

In  the  public  abstract  for  my  grant  proposal,  I  described  the  following  broad 
objectives:  “My  research  program  contains  two  goals  in  support  of  my  career 
objectives:  the  development  of  novel  quantitative,  statistical  methods  that  can 
integrate  genomic  data  and  provide  assessments  of  significance  and  uncertainty, 
and  the  utilization  of  these  methods  for  accurate  modeling  of  prostate  tumor 
progression,  yielding  insights  into  the  molecular  composition  and  interactions  on 
which  it  is  based.”  In  this  respect,  the  work  I  have  completed  over  the  course  of 
this  grant  has  succeeded  in  addressing  these  aims.  These  broad  aims  can  be 
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further  decomposed  into  “Training  Goals”  and  “Research  Goals”,  which  are 
described  below  in  greater  detail. 

Training  Goals 

The  main  training  goals  were  to  develop  sufficient  skills  to  develop  state-of-the- 
art  methods  for  mining  of  biological  data,  and  to  become  highly  knowledge  of 
prostate  cancer.  With  respect  to  the  former,  working  with  statisticians  and 
mathematical  biologists,  I  have  gained  a  strong  foundation  in  mathematical 
modeling  and  understand  how  to  conduct  rigorous,  statistically  sound  research. 
This  has  been  complemented  with  several  courses  in  statistics  and  computer 
science. 

I  have  also  worked  closely  with  Dr  Phillip  Febbo,  a  medical  oncologist  and  expert 
in  prostate  cancer.  Through  the  analyses  conducted  here  (and  described  in  the 
Appendix)  I  have  gained  a  strong  understanding  of  prostate  cancer 
tumorigenesis  and  progression,  as  well  as  the  understanding  of  current  clinical 
limitations  in  treatment  of  prostate  cancer. 

Research  Goals 

I  briefly  summarize  the  research  goals  and  findings  here.  For  (Guinney,  Febbo, 
Maggioni,  &  Mukherjee,  201 0)a  more  detailed  description,  I  refer  the  reader  to 
the  peer-reviewed  article  found  in  the  Appendix  entitled  “Multiscale  factor  models 
for  molecular  networks”,  published  in  the  proceedings  for  the  Joint  Statistical 
Meeting  conference  held  in  Vancouver  in  2010  (Guinney,  Febbo,  Maggioni,  & 
Mukherjee,  2010). 

The  main  goals  of  this  research  project  was  to  (a)  develop  a  computational 
method  for  interrogating  molecular  network,  and  (b)  apply  this  method  to  prostate 
cancer  data  to  under  the  mechanisms  of  cancer  progression.  The  method  is 
based  on  the  concept  of  diffusion  wavelets  (Coifman  &  Maggioni,  2006),  which 
provides  a  natural  multi-scale  decomposition  of  a  network  based  on  a  “random 
walk”  model.  Using  a  molecular  network  as  the  starting  point  for  this  analysis  - 
where  edges  designate  known  or  inferred  direct  interactions  between  genes 
and/or  proteins  -  a  random  walk  followed  by  factorization  is  performed  on  the 
networks  to  generate  soft  clusters  on  the  network.  These  "clusters”  form  a 
natural  grouping  that  can  be  further  exploited  in  a  regression  analysis  to  identify 
meaningful  clusters  that  might  be  important  to  a  disease  or  biological  phenotype. 

When  this  method  was  applied  to  a  networks  composed  of  biological  pathways, 
many  clusters  of  pathways  were  implicated  in  prostate  cancer  recurrence.  These 
involved  the  p27  and  SKP2  pathways,  which  are  regulated  by  ubiquitin  mediated 
proteolysis.  Also  identified  was  the  TGF-beta  signaling  pathway,  which  was 
correlated  with  cell  cycle  control.  A  detailed  description  of  these  findings  may  be 
found  in  the  Appendix. 


5 


Key  Research  Accomplishments 

•  Development  of  methodology  for  multi-scale  decomposition  of  molecular 
networks. 

•  Identification  of  p27  and  TGF-beta  as  important  pathways  regulating  cell- 
cycle  control  in  PC  progression. 

Reportable  Outcomes 

•  Completion  and  award  of  PhD  from  Duke  University 

•  Position  as  a  research  scientist  at  a  non-profit  biomedical  institution,  Sage 
Bionetworks 

•  Publication  of  paper  (see  references) 

•  Presentation  of  paper  at  conference:  Joint  Statistical  Meeting  (JSM),  2010 


Conclusion 

The  research  project  has  culminated  in  the  development  of  a  novel 
computational  method  for  analysis  of  high-throughput  genomic  data,  enabling  the 
identification  of  molecular  pathway  or  networks  active  in  a  disease  context.  This 
has  been  applied  to  prostate  cancer  data  and  has  revealed  important 
mechanisms  for  progression  in  PC,  specifically  the  identification  of  the  TGF-beta 
and  p27  pathways  in  mediating  cell-cycle  control. 

The  methodology  relies  on  the  existence  of  an  a-priori  generated  network  for 
multi-scale  analysis.  The  network  used  for  the  prostate  cancer  analysis  was 
based  on  pathway  overlap;  the  drawback  to  this  approach  is  that  it  is  not  disease 
specific  and  does  not  incorporate  any  known  pathway  associations  that  are  more 
likely  to  occur  in  prostate  cancer,  or  in  tumor  models  more  generally.  Therefore, 
this  method  could  be  improved  by  using  pathways  networks  that  more  naturally 
reflect  the  underlying  biology  of  the  disease  under  investigation.  Techniques 
such  as  Bayesian  network  inference  or  graphical  Gaussian  models  could  be 
used  to  derive  such  networks  in  a  data-driven  context. 

The  impact  of  this  method  is  two-fold.  First,  it  recognizes  that  biological  pathways 
are  not  independent  constructs  but  must  instead  be  understood  within  a  larger 
meta-pathway  context.  Second,  the  mechanisms  of  disease  can  be  studied  from 
multiple  levels  of  granularity,  spanning  single  genes  to  pathways  to  meta¬ 
pathways.  Therefore,  a  multi-scale  approach  enables  different  perspectives  at 
different  resolutions  that  will  aid  in  both  interpretability  as  well  as  generalization 
of  the  disease. 
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Multiscale  factor  models  for  molecular  networks 


Justin  Guinney*  Philip  Febbo^  Mauro  Maggioni^  Sayan  Mukherjee  ■* 
Abstract 

A  factor  modeling  framework  is  developed  that  is  both  predictive  of  phenotypic  or  response 
variation  and  the  inferred  factors  offer  insight  with  respect  to  underlying  physical  or  biolog¬ 
ical  processes.  The  method  is  general  and  can  be  applied  to  a  variety  of  scientific  problems. 
We  focus  on  modeling  complex  disease  phenotypes  (etiology  of  cancer)  as  a  motivating  ex¬ 
ample.  In  this  setting,  the  factors  capture  gene  or  protein  interaction  networks  at  different 
scales  -  breadth  of  the  interaction  network.  The  method  integrates  multiscale  analysis  on 
graphs  and  manifolds  developed  in  applied  harmonic  analysis  with  sparse  factor  models, 
a  mainstay  of  applied  statistics.  Specific  findings  include  the  association  of  the  TGF-/3 
pathway  with  prostate  cancer  recurrence  mediated  by  cell-cycle  control  and  the  implication 
of  the  p27  pathway  in  cancer  progression.  In  silico  perturbation  analyses  of  the  inferred 
multiscale  model  suggest  that  the  TGF-/3  pathway  is  a  dominant  pathway  in  control  of 
cell-cycle  deregulation  in  prostate  cancer. 

Key  Words:  diffusion  geometry,  sparse  regression,  molecular  networks,  factor  models 

1.  Introduction 

Methods  for  analyzing  high-dimensional  data  have  received  much  attention  in  the 
last  decade,  driven  by  increasingly  high-throughput  techniques  for  data  generation 
in  the  social,  physical,  and  biological  sciences.  An  important  challenge  in  all  these 
settings  is  the  inference  of  models  that  are  both  predictive  and  interpretable  -  of¬ 
fering  insight  into  the  underlying  biological,  social,  or  physical  phenomena.  This 
can  often  be  restated  as  understanding  the  structure  and  statistical  dependencies 
of  variables  relevant  to  prediction  of  a  response,  category,  or  phenotype  given  high¬ 
dimensional  data.  One  modeling  principle  common  across  biological  and  physical 
sciences  is  the  idea  of  scale  -  the  phenomena  under  study  is  composed  of  inter¬ 
actions  or  processes  that  vary  depending  on  the  level  of  resolution  at  which  they 
are  examined.  The  other  modeling  principle  is  that  a  sparse  or  low-dimensional 
representation  captures  relevant  information  of  the  high-dimensional  data  -  factor 
modeling  and  manifold  learning  are  two  such  examples.  We  couple  these  two  prin¬ 
ciples  in  the  framework  of  multiscale  factor  models  to  produce  a  low-dimensional 
representation  comprised  of  factors  at  various  scales. 

The  scientific  problem  of  modeling  complex  phenotypes  or  traits  serves  as  a  mo¬ 
tivating  example  to  highlight  the  efficacy  of  this  approach.  However,  the  method 
itself  is  general  and  can  be  applied  to  a  variety  of  scientific  and  engineering  applica¬ 
tions.  We  consider  complex  phenotypes  as  those  controlled  by  many  genes  and  gene 
products  with  complex  interactions.  A  common  property  of  complex  phenotypes  is 
heterogeneity  of  both  the  phenotype  and  its  genetic  and  molecular  basis.  Cancer 
is  a  complex  phenotype  where  the  heterogeneity  is  derived  from  two  main  sources: 
variability  across  time  or  stage  of  disease  and  genetic  and  environmental  variability 

*Sage  Bionetworks,  1100  Fairview  Ave  N,  Seattle,  WA  98109 
^Institute  for  Genome  Sciences  and  Policy,  Duke  University,  Durham,  NC  27708 
■'•Department  of  Mathematics,  Duke  University,  Durham,  NC  27708 
^Department  of  Statistical  Science,  Duke  University,  Durham,  NC  27708 


across  individuals.  The  idea  of  scale  is  central  in  oncogenesis  since  the  set  of  steps 
by  which  interactions  of  genetic,  biochemical,  and  cellular  mechanisms  with  envi¬ 
ronmental  factors  driving  tumor  development  vary  in  complexity  of  the  underlying 
networks  as  well  as  the  timescale  of  the  interactions.  In  this  paper  scale  will  refer 
to  the  granularity  or  specificity  of  molecular  and  cellular  interactions,  and  spans 
the  range  from  physical  binding  of  proteins  to  other  molecules,  to  loosely  coupled 
interactions  of  molecular  pathways  and  networks. 

We  address  the  biological  problem  of  mapping  the  genetic  or  expression  variation 
giving  rise  to  phenotypic  variation  onto  sparse  multiscale  subsets  of  a  putative  direct 
gene  (product)  interaction  network.  We  formalize  this  problem  by  denoting  the 
genes  and  gene  products  as  a  set  of  nodes  V  in  a  graph  Q  =  {V,  £}  where  the 
edges  between  gene  products  quantify  the  direct  interaction.  This  graph  can  be 
represented  as  an  association  matrix  W  where  the  elements  Wl3  encode  dependence 
between  two  nodes.  Gene  expression  or  genetic  variation  is  assayed  on  the  nodes  or 
gene  products  in  the  interaction  network.  A  set  of  n  observations  of  the  expression 
measurements  and  phenotypes  is  denoted  as  {(W,  T))}”=1  where  Y  is  the  phenotype 
and  X  €  Mm  are  the  measurements  over  the  m  gene  products  in  the  graph.  The 
multiscale  factor  framework  specifies  the  following  model  for  phenotypic  variation 

p 

Yi  =  Y,Mi{Xi)  +  ei,  (1) 

l=i 

where  e%  charaterizes  the  noise,  <j>i(Xi)  =  (. Xi,(j>i )  is  the  projection  of  the  observa¬ 
tion  X{  onto  the  multiscale  factor  (f>i ,  and  oq  models  the  relevance  of  factor  l  in 
predicting  phenotypic  variation.  The  multiscale  bases  or  factors  </>;  are  constructed 
from  the  association  matrix  W .  The  index  l  has  two  components:  l  =  (j,  k),  where 
j  parametrizes  the  scale  of  the  factor  -  related  to  number  of  nodes  comprising  the 
factor  -  and  k  indexes  factors  at  each  scale  j.  At  the  finest  scale  the  factors  are 
single  genes,  and  at  coarser  scales  ( j  increasing)  they  are  linear  combinations  of 
highly  independent  genes.  At  the  coarsest  scale  these  factors  correspond  to  eigen- 
genes  or  metagenes  used  in  singular  value  decomposition  analysis  [8]  and  sparse 
factor  modeling  [2],  respectively. 

The  main  innovation  of  our  approach  is  that  inferred  factors  can  be  interpreted 
as  subnetworks  at  various  scales  of  molecular  and  cellular  processes  relevant  to 
explaining  phenotypic  variation. 

2.  Sparse  multiscale  factor  models 

The  multiscale  factor  model  (1)  is  specified  as 
p  J  K:i 

Yi  =  ~y  ^  ai4>i(Xi)  +  £i  =  EE  (X^j  T  et ,  i  —  1, ..., n 

1=1  j=l  k= 0 

where  i  indexes  the  observations  and  l  is  a  double  index.  In  the  second  expression 
scale  is  made  explicit  by  decomposing  l  into  a  scale  index  j  and  indexing  factors  at 
scale  j  by  index  k.  Sparsity  implies  that  only  a  few  factors  are  required  to  explain 
variation  in  the  response  and  is  helpful  in  interpreting  the  inferred  model.  We  use 
the  following  generalization  of  the  Lasso  estimator  [1]  to  infer  a  sparse  model 

n  p  p 

d  =  arg  min  'S~]  \yi  -  ai4n(xi)\2  +  A  V]  \cq\  .  (2) 

aEKP  z L z ' 

li=l  1=1  1=1  J 


The  regularization  parameter  A  controls  the  trade-off  between  fitting  the  data  and 
sparsity  of  the  solution  and  is  estimated  using  bootstrapping,  see  Materials  and 
methods.  In  the  classical  Lasso  formulation  the  features  are  the  m  coordinates  of 
the  data  ,  4>i(x)  =  (x,ei),  where  e;  is  the  l- tli  coordinate  basis. 

The  method  of  diffusion  wavelets  is  used  to  construct  the  multiscale  factors, 
{(/>i, ..., . . .  4>p}.  A  putative  interaction  graph  of  the  variables  (e.g.  genes)  is  either 
given  or  inferred,  vertices  correspond  to  variables  and  the  edges  represent  depen¬ 
dence  bewtween  variables.  This  graph  is  represented  by  an  association  matrix  W 
with  element  Wij  encoding  the  dependence  between  the  i-th  and  j-th  variables. 
This  matrix  may  be  given  as  a  priori  knowledge  such  as  a  protein-protein  interac¬ 
tion  network  or  defined  by  local  interactions  in  the  data,  for  example  W a  (xt ,  xj )  = 
exp(— \\xi  —  Xj\\2/a)  with  a  >  0  [15,  12].  Given  the  association  matrix  the  diffusion 
operator  [12]  on  the  graph  is  defined  as 

T  =  D~1/2WD~1/2,  with  Dit  =  Y  wij- 
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This  operator  is  related  to  the  graph  Laplacian  L  [15,  12],  L  =  I  —  T.  In  the  case  of 
a  Gaussian  graphical  model,  T  corresponds  to  the  partial  correlation  matrix  [6,  5]. 
In  manifold  learning,  the  eigenfunctions  of  this  operator  are  used  as  global  basis 
factors  to  capture  information  on  the  geometry  and  local  interactions  underlying 
the  data.  This  can  be  thought  of  as  a  nonlinear  version  of  principal  components 
analysis  (PCA),  the  nonlinearity  is  a  function  of  the  eigenfuctions  of  T  [35].  A 
drawback  of  this  approach  is  that  the  eigenfunctions  tend  to  be  global,  a  linear 
combination  of  all  variables,  and  hence  difficult  to  interpret. 

In  biological  applications  where  interpretability  may  be  as  important  as  predic¬ 
tion  performance  factors  with  a  few  genes  may  be  preferred  even  if  they  have  lower 
predictive  accuracy  than  these  eigenfunctions  or  eigengenes.  In  most  applications 
the  number  of  variables  is  far  greater  than  the  number  of  observations  m  3>  n 
resulting  in  a  very  large  number  of  few  gene  factors  that  are  equally  predictive, 
complicating  the  biological  interpretability  of  factors  with  few  genes.  This  ambi¬ 
guity  arises  from  the  complex  relationships  between  genes  and  we  address  it  by 
constructing  a  multiscale  family  of  factors,  from  single  genes  to  eigengenes.  The 
sparsity  constraint  in  the  regression  model  is  then  used  to  select  predictive  factors. 
Typically  we  obtain  significant  factors  at  different  scales. 

In  order  to  generate  the  different  scales,  we  observe  that  T  only  encodes  local 
relationships  between  the  variables,  the  partial  correlation  matrix  is  sparse  in  the 
terminology  of  graphical  models.  Powers  of  T  integrate  or  propagate  local  depen¬ 
dencies  to  more  global  dependencies.  In  a  graphical  model  T*(z,j)  corresponds  to 
integrating  the  partial  correlation  across  all  paths  of  length  t  from  variable  i  to 
variable  j.  To  recover  the  dependencies  across  all  the  variables  we  sum  across  all 
path-lengths  t 

+oo 

(/  —  T)-1  =  y,  1*  , 

t= l 

which  converges  in  the  complement  of  the  eigenspace  of  T  corresponding  to  the 
eigenvalue  1  .  Similar  expansions  were  used  in  path  analysis  [3,  4]  to  decompose 
genetic  and  phenotypic  variation  and  in  graphical  models  [6,  5]  for  both  computation 
and  inference.  This  expansion  can  be  represented  in  a  truly  multiscale  fashion  as  a 


product  of  multiscale  models  by  the  limit  of  the  following  expansion  (J  — >  +oo) 

2J  /  2J~1  \  j 

=  I  (l  +  T2^)  =  f[{I  +  T2t).  (3) 

t= 1  y  t=i  y  <=o 

The  factors  at  scale  j,  {<Pj,k}k=v  are  constructed  by  analyzing  T 2  and  construct¬ 

ing  a  basis  of  sparse  vectors  spanning  its  range,  which  is  low-dimensional.  T2j 
can  be  represented  by  a  small  matrix  acting  on  the  range  of  T 23  .  This  decompo¬ 

sition  is  the  key  idea  in  diffusion  wavelets,  which  given  the  graph  Q  yields  a  set  of 
multiscale  features  or  bases  {{(/>j,k}k=o}j=v  The  bases  <j>j^  at  different  scales  are 
related  to  each  other:  each  c f>j ^  is  a  linear  combination  of  a  small  number  of  <^-_i  fc’s 
at  the  previous  finer  scale,  yielding  a  hierarchical  structure  among  these  factors.  As 
j  grows  or  the  factors  become  more  global  and  approximate  the  top  eigenvectors  of 
T.  For  more  details  on  diffusion  wavelets  see  [13]. 

Given  p  factors  and  n  observations  where  we  would  like  a  sparse  set  of 

factors  to  capture  variation  in  the  phenotype  in  equation  (1).  The  sparsity  of  the 
model  has  a  strong  dependence  on  the  set  of  factors  used.  A  sparse  representation 
given  one  set  of  factors  may  be  dense  with  respect  to  another  set.  For  example, 
sparsity  with  respect  to  a  few  single  gene  factors  is  very  different  than  sparsity  with 
respect  to  principal  components  or  eigengenes.  The  multiscale  factors  capture  both 
extremes  as  well  as  scales  in  between,  providing  a  rich  set  of  bases. 

The  advantages  of  the  multiscale  representations  with  respect  to  sparsity  and 
interpretability  are  due  to  three  aspects  of  the  method.  A  priori  or  data  adapted 
knowledge  of  dependence  between  the  variables  is  embedded  in  the  bases  by  the 
association  matrix  W.  The  multiscale  bases  interpolate  between  models  comprised 
of  single  genes  and  models  comprised  of  global  factors  models  based  on  the  spectral 
decompositions  of  the  diffusion  operator  T,  (kernel)  PCA  sparse  regression.  There 
is  a  hierarchical  relation  between  scales,  each  basis  at  a  coarser  scale  is  a  sparse 
combination  of  bases  at  a  finer  scale 

2.1  Comparison  with  other  decomposition  and  clustering  methods 

In  this  section  we  develop  the  relation  between  other  matrix  factorization  approaches 
and  the  multiscale  factors  and  relate  the  multiscale  factors  with  hierarchical  clus¬ 
tering. 

An  important  step  in  many  analyses  is  that  of  modeling  or  factorizing  the 
data  matrix  X  £  Mmxn  with  the  goal  of  a  reduced  or  compressed  representation. 
The  most  common  representation  or  factorization  is  principal  component  analy¬ 
sis  (PCA).  The  data  matrix  is  represented  by  the  following  eigendecomposition 
X  =  UXVT ,  where  U  £  Mmxm  and  y  g  Mnxn  are  orthogonal  and  £  £  Mmxn  is 
diagonal  with  entries  a±  >  . . .  an  >  0.  The  best  k  rank  approximation  of  X  consist 
of  the  first  k  columns  of  U  and  V,  Xk  =  When  the  data  matrix  is  low 

rank  this  representation  is  natural.  However  interpretation  of  this  representation  is 
challenging  since  the  columns  of  U  and  V  are  typically  vectors  with  all  entries  non¬ 
zero.  In  this  case  the  factors  are  the  columns  of  U  which  corresponds  to  a  mixture 
of  positive  and  negative  coefficients  for  all  the  genes,  all  the  factors  are  global. 

An  alternative  decomposition  is  a  CUR-type  decomposition  [14]  where  the  data 
matrix  X  takes  the  form  X  ~  CUR ,  where  C  £  Mmxfc,  U  £  R  £  Rfcxn. 

C  and  R  are  subsets  of  k  columns  and  rows  of  the  data  matrix,  respectively.  U, 
notwithstanding  the  name,  is  not  unitary  in  general.  The  interpretation  of  this 


decomposition  is  picking  k  genes  and  data  points  from  the  data  for  the  matrices  C 
an  R  and  the  constructing  U  to  well  approximate  X.  U  is  in  general  a  full  matrix 
and  is  complicated  to  interpret.  A  potential  disadvantage  of  CUR  decompositions 
is  they  may  be  poor  rank  k  approximations  of  the  data  matrix  as  compared  to  PCA. 
This  can  be  obviated  by  careful  and  algorithmically  efficient  choices  of  C ,  U,  R,  as 
shown  in  [14]. 

In  light  of  these  approaches,  we  interpret  the  multiscale  approach  as  “inter¬ 
polating”  between  the  “full”  features  of  PCA  and  the  single  features  of  CUR, 
by  creating  a  multiscale  dictionary  of  features  representing  the  data.  These  fea¬ 
tures  are  nonlinear  functions  on  the  data,  capturing  nonlinear  subspaces  charac¬ 
terizing  the  data.  While  in  this  paper  we  focus  on  the  predictive  aspects  of  the 
model,  our  construction  may  be  used  to  construct  a  multiscale  representation  of 
the  rows  and  the  columns  of  the  data  matrix  X  with  dictionaries  4>rows  =  {^T^8} 
and  $cols  =  {(/>“^s},  respectively.  One  representation  of  the  data  matrix  is  X  = 
4>jA4>J  +  $j_i(A  —  4>jX4>j)Tj_1  +  •  •  • ,  the  data  matrix  is  a  superposition  of 
coarse-to-fine  approximations. 

A  related  task  to  factorizing  the  data  is  hierarchically  grouping  the  data.  Ar¬ 
guably  one  of  the  most  popular  methods  used  for  the  unsupervised  analysis  and 
identification  of  biologically  meaningful  clusters  is  agglomerative  hierarchical  clus¬ 
tering  [7].  A  problem  with  hierarchical  clustering  is  its  sensitivity  to  the  choice  of 
linkage  function  and  a  lack  of  coherent  methodology  to  select  link  functions.  This 
problem  is  extenuated  when  one  agglomerates  clusters  at  larger  scales,  since  the  de¬ 
cision  to  agglomerate  is  based  on  very  basic  cluster  statistics  such  as  correlation  of 
cluster  centers  that  may  not  be  robust.  Another  approach  to  hierarchical  grouping 
is  based  on  recursive  partitioning  using  spectral  methods  [11,  10].  Unfortunately, 
these  “top-down”  approaches  do  not  seem  to  perform  well  on  complex  biological 
networks  which  are  often  characterized  by  slow  spectral  decays.  While  useful  for 
global  partitioning  of  data,  spectral  methods  can  magnify  errors  made  early  on, 
making  it  difficult  to  infer  local  partitions  of  the  data.  The  same  problem  is  in¬ 
herent  in  factor  models  that  use  principal  components  to  generate  factors.  We  will 
show  that  the  mutiscale  factors  provide  robust  hierarchical  clusters. 

The  final  two  aspects  of  the  multiscale  factor  model  that  other  factorization 
or  clustering  approaches  fail  to  address  is  that  the  clusters  or  low-rank  subspaces 
may  not  be  the  most  relevant  with  respect  to  predicting  the  response  or  phenotype. 
We  find  predictive  factors  by  searching  the  redundant  dictionary  of  multiscale  fac¬ 
tors.  In  addition  none  of  the  other  methods  can  impose  a  priori  or  data  adapted 
information  of  the  gene  interaction  graph  Q  in  factorization  or  grouping. 

3.  Results 

An  intuition  of  the  modeling  principles  and  the  flexibility  of  multiscale  factor  models 
is  developed  over  a  series  of  applications.  The  first  two  applications  highlight  the 
information  content  in  the  factors.  The  next  three  applications  focus  on  both  the 
predictive  and  interpretive  aspects  of  the  model  and  illustrate  how  the  model  can 
be  used  to  study  the  effect  of  perturbations  of  a  gene  network. 

3.1  Subphenotype  discovery  with  multiscale  factors 

We  begin  with  a  study  of  hierarchical  clustering  as  applied  to  the  analysis  of  gene 
expression  data.  In  genetically  heterogeneous  diseases  such  as  cancer,  clustering 


Figure  1:  Clustering  on  a  swiss  roll.  [A]  Hierarchical  clustering  (complete  linkage) 
[B]  Clustering  with  diffusion  wavelets. 


methods  have  been  applied  to  genome-wide  expression  data  to  identify  molecularly 
distinct  sub-phenotypes  of  patients  [26,  28,  30].  Such  discoveries  are  of  immense 
interest  as  they  offer  the  potential  for  personalized  medicine  through  targeted  ther¬ 
apies  and  improved  prognosis.  It  is  often  unclear  if  inferred  clusters  are  spurious 
artifacts  of  the  clustering  methodology  or  reflect  the  intrinsic  structure  of  the  data. 
We  explore  this  problem  and  contrast  hierarchical  clustering  with  clustering  using 
diffusion  wavelets. 

Fig.  1  illustrates  a  toy  example  illustrating  the  advantage  of  diffusion  wavelets 
over  agglomerative  hierarchical  clustering.  The  data  consists  of  three  classes  or 
groups  -  a  three  component  mixture  model  -  on  a  spiral  manifold.  These  three 
classes  are  captured  by  diffusion  wavelets,  Fig.  lb,  but  not  by  hierarchical  clustering 
with  complete  linkage,  Fig.  la. 

This  problem  is  also  seen  in  real  data.  In  [31]  292  high-risk  breast  cancer  patients 
were  hierarchically  clustered  into  patient  subgroups  based  on  11  scores  for  each 
patient  computed  from  signatures  of  pathway  deregulation,  a  11  x  292  data  matrix. 
A  few  of  these  clusters  were  able  to  stratify  patients  with  respect  to  recurrence  rates. 
A  natural  question  to  ask  is  how  robust  are  these  clusters  and  do  they  represent 
well-defined  risk  populations?  When  we  apply  hierarchical  clustering  with  complete 
linkage  on  this  same  data  set,  we  are  able  to  recapitulate  the  risk  stratifications 
previously  reported  [31]  (log-rank,  p  <  .001).  However,  clustering  with  average 
linkage  produces  clusters  without  any  significant  difference  in  recurrence  rates.  This 
raises  some  concerns  about  the  stability  of  the  clusters. 

Diffusion  wavelets  can  be  used  to  cluster  this  data,  see  materials  and  methods. 
The  bases  or  factors  {4>j,k}  can  be  used  to  define  hierarchical  clusters  since  each 
factor  assigns  to  each  patient  a  probability  of  belonging  to  the  factor  or  cluster. 
Our  objective  was  to  examine  which  clusters  were  able  to  stratify  patients  by  risk. 
Adding  the  constraint  that  we  were  only  interested  in  moderate  to  large  clusters 
allowed  us  to  ignore  clusters  at  local  scales.  We  conducted  pair-wise  tests  between 
clusters  at  the  same  scale  to  see  if  the  two  clusters  stratified  patients  according  to 
recurrence  rates,  see  Materials  and  Methods. 

Results  from  this  analysis  suggest  the  presence  of  global  risk  patterns.  Three 
significant  cluster  pairs  were  found  at  the  coarser  scales:  one  significant  cluster  pair 
at  the  6th  or  global  scale  (log-rank,  p  =  .006)  and  two  modestly  significant  cluster 
pairs  at  the  5th  scale  (log-rank,  p  =  .06,  p  =  .08).  This  suggests  that  there  may  not 
be  multiple  distinct  molecular  sub-  phenotypes  as  previously  reported. 


3.2  Gene  Ontologies 

The  Gene  Ontology  (GO)  project  [9]  has  defined  a  multiscale  or  hierarchical  orga¬ 
nization  of  function-based  associations  annotating  biological  processes,  molecular 
functions,  and  cellular  localization.  This  has  become  the  gold-standard  for  describ¬ 
ing  functional  relationships  between  sets  of  genes. 

The  Gene  Ontology  (GO)  database  was  used  in  [18]  to  construct  a  metric  or 
distance  functional  between  protein  structure  domains  that  captures  variation  in 
the  function  of  the  protein.  The  authors  in  this  study  used  various  kernel  functions 
K(-)  or  similarity  measures  to  embed  the  gene  ontology  graph  in  a  Euclidean  space. 
Distances  in  this  Euclidean  space  reflect  functional  distances  and  it  was  shown  that 
distance  between  functional  domain  subgraphs  using  these  kernel  embeddings  was 
highly  correlated  with  protein  structure  domains,  using  the  DALI  Z  score  [19].  The 
geodesic  distance,  computed  as  the  shortest  distance  on  the  GO  graph,  was  unable 
to  recover  any  meaningful  structural  correlation.  This  problem  is  related  to  the 
previous  example  in  the  sense  that  geodesic  distance  corresponds  to  hierarchical 
clustering  with  single  linkage.  The  kernel  functions  used  in  [18]  were  local  linear 
embeddings  [16],  graph  diffusions  [12],  and  the  inverse  Laplacian  to  respectively 
capture  local,  medium,  and  global  distances.  However,  the  scale  between  these  dif¬ 
ferent  kernels  is  not  explicit,  making  it  difficult  to  intuit  how  the  kernels  correspond 
to  different  scalings.  Diffusion  wavelets,  on  the  other  hand,  provide  an  explicit  and 
optimally  finite  number  of  scales  to  consider. 

We  repeated  the  analysis  in  [18]  using  diffusion  wavelets  for  generating  kernels, 
where  Kj(-)  =  at  scale  j,  and  found  significant  correlation  between  functional 
distance  (on  the  GO  graph)  and  protein  domain  structure,  Fig.  2.  The  fit  of  an 
exponential  decay  model  to  the  correlation  values  shows  increasing  decay  rates  with 
r  =  {~  0,  2,  3}  corresponding  to  local,  intermediate,  and  global  scales  respectively, 
see  Materials  and  Methods.  The  correspondence  between  increasing  r  and  global 
scales  reflects  the  impact  of  global  topology  in  the  evaluation  of  functional  distances. 
From  Fig.  2,  we  also  observe  greater  noise  at  smaller  functional  distances  for  lo¬ 
cal  scales.  This  illustrates  that  diffusion  at  higher  scales  smooths  or  denoises  the 
evaluation  of  functional  distances. 

3.3  Science  documents  classification 

Classification  of  scientific  documents  is  used  to  highlight  the  accuracy  of  the  multi¬ 
scale  model  and  the  interpretability  of  the  factors.  Given  a  document- word  matrix 
M  composed  of  1153  words  from  1161  articles  gathered  from  Science  News,  define 
Mij  as  the  frequency  of  the  jth  word  in  the  ith  article.  Each  article  is  also  annotated 
according  to  one  of  eight  scientific  subjects  -  Anthropology,  Astronomy,  Social  Sci¬ 
ences,  Earth  Sciences,  Biology,  Mathematics,  Medicine,  and  Physics.  A  preliminary 
multiscale  analysis  of  the  document  graph  using  articles  as  nodes  reveals  a  complex, 
hierarchical  structure  on  the  data,  see  Supp.  materials.  For  the  supervised  analysis, 
we  use  M  to  construct  a  weighted  word  graph  W  with  words  as  nodes  and  edge 
weights  computed  by  pair-wise  word  co-occurence  across  documents. 

The  objective  is  inference  of  a  discriminative  function  capable  of  distinguishing 
Earth  Sciences  documents  (class  A)  from  all  other  documents  (class  B)  using  the 
multiscale  factor  model.  Factors  were  generated  using  diffusion  wavelets  on  the 
weighted  word  graph  W  and  the  factor  weights  were  inferred  based  on  equation  (2). 
The  data  was  randomly  split  into  a  training  set  of  80  documents  from  class  A  and 
B  with  the  remaining  documents  comprising  a  test  set. 


Functional  Distance  -  Diffusion 

Figure  2:  Functional  distance  (measured  on  a  gene  ontology  graph)  versus  struc¬ 
tural  similarity  of  protein  domains  at  different  scales. 


We  compared  the  performance  to  the  multiscale  factor  model  with  the  satu¬ 
rated  model  (Lasso  on  the  original  features),  SVD  regression,  and  regression  using 
the  eigenvectors  of  the  Laplacian.  Bootstrapping  was  used  for  fitting  all  hyper¬ 
parameters  (see  Materials  and  Methods).  The  performance  metric  used  was  the 
area  under  the  receiver  operating  curve  (AUC)  on  the  test  data.  The  experiment 
is  run  20  times  and  results  are  tabulated  (Tbl.  1)  -  the  multiscale  classification 
method  outperforms  the  other  methods  as  measured  by  AUC. 


Method 

Sat 

SVD 

Lap 

MSF 

Avg  AUC 

.886 

.885 

.914 

.934 

Std  dev 

.028 

.038 

.037 

.022 

Table  1:  Performance  comparison  for  discrimination  of  science  documents.  Lasso  is 
applied  to  factors  generated  from  following  methods:  (Sat)  original  features,  (SVD) 
singular  valued  components,  (Lap)  eigenvectors  of  the  Laplacian,  (MSF)  multiscale 
factors  (diffusion  wavelets). 


The  factors  with  significant  factor  loadings  are  also  interpretable.  The  scale  of 
the  factors  positively  correlated  with  the  Earth  Science  category  reflect  levels  of 
specialization  within  the  topic  of  Earth  Science.  The  fine  scale  factors  contain  spe¬ 
cific  words  such  as  ’nitrogen’,  words  such  as  ’pest’,  ’crops’,  and  ’plants’  appear  at  an 
intermediary  scale,  and  words  such  as  ’weather’  and  ’forecast’  appear  at  the  global 
scale.  This  assignment  reflects  the  correspondence  between  scale  and  generality  with 
respect  to  subject  matter.  It  stands  to  reason  that  that  the  highly  weighted  fea¬ 
tures  of  global  scaling  functions  will  be  those  features  with  the  strongest  or  largest 
number  of  dependencies.  In  Earth  Science,  the  word  ’weather’  appropriately  fits 
this  description. 
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Figure  3:  Predicting  Prostate  Cancer  Recurrence,  comparison  of  the  multiscale 
factor  models  with  biomarker  (A,  left)  and  global  factor  models  (B,  right). 

3.4  Predicting  Prostate  Cancer  Recurrence 

The  prostate-specific  antigen  (PSA)  test  is  the  current  standard  for  monitoring  and 
assessing  prostate  cancer  (PC)  risk  in  men.  While  PSA-based  screening  has  likely 
contributed  to  the  decline  of  PC  mortaility  in  the  last  decade,  its  high  false  positive 
rate  has  resulted  in  overtreatment  and  increased  morbidity.  Consequently,  there  is 
much  interest  in  finding  alternative  methods  for  distinguishing  between  aggressive 
and  indolent  forms  of  PC. 

Gene  expression  profiling  has  led  to  several  biomarker  based  models  to  predict 
PC  recurrence  [21,  26,  24,  25,  27].  These  models  are  based  on  assaying  the  expression 
of  a  few  genes.  Unfortunately,  the  models  offer  only  modest  improvement  over 
standard  clinical  models  for  predicting  outcome.  Moreover,  there  is  little  overlap 
in  the  genes  used  in  each  of  these  models.  This  highlights  the  perils  of  gene-based 
predictor  models  for  complex  disease:  technical  variation,  strong  genotypic  and 
phenotypic  heterogeneity,  and  complex  molecular  interactions  limit  the  ability  of 
these  models  to  generalize. 

We  compared  the  multiscale  factor  model  to  these  biomarker  based  models  in 
predicting  PC  recurrence,  see  Fig.  3a.  We  also  compared  our  model  to  other 
regression  methods:  SVD  regression,  factor  regression  using  eigenvectors  of  the 
graph  Laplacian  as  factors,  and  Lasso  on  the  original  features,  see  Fig.  3b.  The  data 
used  was  collected  from  a  large  prostate  cancer  data  bank  composed  of  596  patients 
treated  with  radical  prostatectomies  and  monitored  for  up  to  10  years  [21].  Based 
on  clinical  outcome,  each  patient  is  grouped  into  one  of  three  categories:  PC  with 
no  evidence  of  disease  recurrence  (NED,  n  =  195),  PC  with  biochemical  recurrence 
measured  by  PSA  (PSA,  n  =  201),  and  PC  with  metastatic  recurrence  (SYS,  n  = 
200).  Gene  expression  data  is  available  for  each  patient  measured  on  Illumina’s 
generic  cancer  array  and  a  custom  prostate  cancer  array,  resulting  in  a  total  of  1024 
mRNA  probe  measurements  for  each  patient.  A  dependency  graph  with  each  node 
corresponding  to  a  probe  was  computed  using  a  thresholded  correlation  statistic.  A 
multiscale  factor  model  to  discriminate  PSA  from  SYS  was  inferred.  The  inferred 
model  had  eight  factors  ranging  from  local  to  global.  AUC  on  a  validation  set  was 
used  to  measure  the  predictive  performance  of  the  models.  The  multiscale  factor 
model  outperforms  all  other  methods. 


3.5  PC  Multiscale  Pathway  Analysis 

The  limitation  of  the  predictive  accuracy  of  single  gene  models  for  PC  recurrence 
was  observed  in  the  previous  section.  Inference  and  interpretation  of  molecular 
mechanism  giving  rise  to  PC  recurrence  at  the  level  of  single  genes  suffers  the  same 
problem  of  lack  of  generalization  and  robustness.  This  difficulty  has  motivated 
analysis  based  on  gene  sets ,  which  are  often  more  robust  and  interpretable  than 
analyses  of  single  genes.  Methods  for  gene  set  analysis  [22,  20,  23]  provide  statistical 
evaluation  of  co-expression  of  genes  in  a  priori  defined  sets.  The  sets  capture  prior 
biological  knowledge  such  as  the  KEGG  and  BioCarta  pathways  or  are  based  on 
experiments  using  known  molecular  perturbations. 

We  infer  and  interpret  multiscale  factor  models  in  the  space  of  gene  sets  that 
are  predictive  of  PC  recurrence.  The  nodes  or  variables  in  this  model  will  be  gene 
sets.  Specifically,  we  use  639  curated  genesets  defined  by  the  Molecular  Signa¬ 
ture  Database  [22],  An  association  graph  reflecting  similarity  between  gene  sets 
is  constructed,  see  Materials  and  Methods.  Gene  expression  data  for  197  patients 
was  compiled  from  two  independent  prostate  tumor  banks  (n  =  79,  Memorial; 
n  =  118,  Catalona).  Each  patient  was  given  a  clinical  annotation  of  biochemical 
recurrence  measured  by  PSA  elevation.  The  expression  data  A  is  a  m  x  n  matrix 
with  m  =  22,  000  probes  assayed  and  n  =  197  observations.  This  data  is  mapped 
to  gene  set  enrichment  summary  statistics  Z  which  is  a  g  x  n  matrix  with  g  =  639 
gene  sets.  ZtJ  provides  an  estimate  of  the  constitutive  differential  enrichment  of  the 
genes  in  the  the  i-th  gene  set  in  the  j-th  patient,  see  materials  and  methods  for  de¬ 
tails.  The  multiscale  factor  model  was  applied  using  the  summary  statistics  Z  as  the 
explanatory  variables,  recurrence  (predicted  from  the  previous  multiscale  model)  as 
the  response  categorical  response  variable,  and  an  association  graph  reflecting  gene 
set  similarity,  see  Materials  and  Methods.  See  Supp.  materials  Table  SI  for  details 
of  the  inferred  model,  and  gene  set  membership  probabilities  for  significant  factors. 

We  focus  on  a  few  interesting  observations  resulting  from  the  multiscale  analysis. 
Fig.  4  displays  a  coarse  scale  factor  that  suggests  the  involvement  of  TGF-/3  with 
cell-cycle  mediated  pathways  such  as  the  Gl,  G2,  and  p53  since  these  pathways  are 
up-regulated  in  samples  with  high  predicted  probability  of  PC  recurrence.  Further 
evidence  for  functional  association  with  cell-cycle  is  given  by  the  direct  connection 
between  the  up-regulated  TGF-/3  cell-cycle  gene  sets.  At  a  finer  scale  we  observe 
two  TGF-/3  related  gene  sets  with  enrichment  in  opposite  directions,  see  Fig.  4a  . 
This  may  reflect  differential  activation  of  subpathways  as  TGF-/3  has  been  shown 
to  have  both  repressive  and  proliferative  roles  [29].  Also  of  interest  is  the  presence 
and  up-regulation  of  the  p27  pathway  and  its  gene-  set  neighbors  involved  with 
ubiquitin  mediated  proteolysis.  The  p27  protein  is  a  cell-cycle  inhibitor  and  tumor 
suppressor.  p27  itself  is  known  to  be  controlled  by  post-translational  degradation 
via  ubiquitination,  captured  by  our  model  in  Fig.  4B. 

Using  our  model  we  would  like  to  examine  the  effect  of  a  drug  that  would  disable 
TGF-/3  pathways.  Of  particular  interest  is  whether  compensating  pathways  appear, 
thereby  negating  the  effect  of  a  drug  targeting  TGF-/3.  We  simulate  the  disabling 
of  TGF-/3  by  removing  the  two  TGF-/3  related  pathways  in  the  association  graph 
and  then  applying  the  multiscale  factor  model  on  the  modified  graph.  Results  of 
this  analysis  show  that  no  new  pathways  appear  among  the  selected  factors,  and 
most  of  the  pathways  involved  in  cell-cycle  deregulation  in  Fig.  4  do  not  appear. 
We  do  continue  to  see  the  ubiquitin  proteolysis  pathways  in  Fig.  4B.  This  indicates 
that  the  dependency  between  p27  ubiquitination  and  TGF-/3  is  weak  and  therefore 
likely  unaffected  by  the  disabling  of  TGF-/3  pathways. 
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Figure  4:  TGF-/3  scaling  functions.  The  intensity  of  the  red/blue  nodes  cor¬ 
responds  to  the  amount  of  up/down  regulation  of  the  pathways  between  the  two 
classes.  The  size  of  the  node  corresponds  to  the  weight  of  the  pathway  in  the  scaling 
function.  Edges  denote  connections  from  the  original  dependency  graph.  A.  Local 
TGF-/3  related  scaling  function,  also  observed  as  part  of  a  global  scaling  function 
(entire  figure).  B.  The  only  pathways  selected  after  network  damage,  i.e.  removal 
of  the  TGF-/3  pathways. 


4.  Discussion 

The  multiscale  factor  model  based  on  diffusion  wavelets  is  an  intuitive  and  robust 
framework  for  analyzing  high-dimensional  biological  data  with  complex  dependen¬ 
cies.  The  method  as  presented  is  applicable  to  a  variety  of  scientific  settings  in¬ 
volving  high-dimensional  data.  We  show  that  it  is  particularly  well-suited  for  the 
analysis  of  molecular  networks.  This  addresses  a  major  challenge  in  computational 
biology  -  the  development  of  models  that  are  simultaneously  robust  and  and  allow 
for  the  interpretation  of  underlying  biological  mechanism. 

We  note  that  the  sparsity  penalty  used  in  inference  of  the  factor  loads  in  (2)  is 
scale  agnostic,  there  is  no  scale  bias  in  the  optimization.  We  can  imagine  situations 
where  prior  knowledge  suggests  preference  for  certain  scales  or  alternatively  we 
wish  to  infer  which  scale  or  scales  should  be  given  preference.  The  regularization 
penalty  in  (2)  can  be  modified  to  weigh  scales  differently  with  a  hyper-parameter 
that  controls  the  emphasis  of  factors  as  a  function  of  scale  and  can  be  set  a  priori 
or  adapted  based  on  data. 

Finally,  we  did  not  utilize  direct  protein-protein  or  protein-DNA  information 
from  curated  databases,  such  as  HPRD  [34],  in  the  construction  of  the  interaction 
graphs.  We  believe  direct  binding  information  is  useful  in  well-annotated  model 
systems  such  as  budding  yeast,  but  can  be  misleading  in  highly  context-sensitive 
systems  such  as  human  cancer.  However,  we  do  believe  our  modeling  framework 
can  be  appropriately  applied  to  well  characterized  binding  networks. 


5.  Materials  and  Methods 


5.1  Gene  expression  data 

The  573  breast  cancer  samples  can  be  downloaded  from  the  Gene  Expression  Om¬ 
nibus  [GEO]  microarray  data  repository  at  http:/ / www.ncbi.nhn. nih.gov/geo  under 
the  GEO  accession  numbers  GSE2034,  GSE4922,  GSE3143,  and  GSE7849.  These 
data  sets  were  RMA  normalized  and  corrected  for  batch  effect;  see  [31]  for  details. 
Microarray  data  used  in  the  training  of  the  prostate  cancer  recurrence  model  is 
available  in  GEO  with  accession  number  GSE10645.  The  prostate  data  used  for 
validation  and  pathways  analysis  come  from  two  independent  radical  prostatectomy 
series  from  Washington  University  (Catalona)  and  Memorial  Sloan  Kettering  Cancer 
Center  (Memorial).  These  series  are  not  publicly  available. 

5.2  Local  Similarity  and  Graph  Construction 

We  studied  the  following  distance  metrics  for  defining  local  similarity  across  points 
or  variables  -  Pearson  correlation,  Spearman  rank  correlation,  and  mutual  informa¬ 
tion.  We  did  not  observe  any  significant  differences  in  results  between  these  metrics, 
and  therefore  used  Pearson  correlation  for  the  results  presented  in  this  paper  unless 
otherwise  noted.  To  emphasize  local  geometry  and  to  induce  some  sparseness  in  the 
graphs,  all  graphs  were  thresholded  at  the  60%  percentile. 

5.3  Phenotype  Clustering  in  Breast  Cancer 

The  272  patient  node  graph  is  deduced  from  the  similarity  matrix  W  =  exp(— ||a;j  — 
xj ||2/.5),  where  x  is  an  11-dimensional  vector  corresponding  to  11  genomic  signa¬ 
ture  scores  [31].  Multiple  hard  clusters  are  generated  by  sampling  from  each  <j)j^ 
using  the  corresponding  vector  of  probabilities  (|</>j,fc(xj)|2)"=1.  Survival  differences 
between  cluster  pairs  are  evaluated  using  the  log-rank  statistic.  For  details  concern¬ 
ing  the  breast  cancer  data  set  and  generation  of  signature  enrichment  values,  see 
[31]  and  [33], 

5.4  Comparison  of  Functional  Distance  and  Structural  Similarity  At 
Multiple  Scales 

We  use  the  method  from  [18]  to  compare  correlations  across  scales,  that  is,  we  fit  a 
decaying  exponential  using  the  function  Y  =  yo+Ae~TX .  The  value  of  r  describes  the 
rate  of  decay  where  larger  values  of  r  correspond  to  a  faster  decay  rate.  In  the  con¬ 
text  of  functional  distances,  we  attribute  a  faster  decay  rate  as  incorporating  more 
distant  domains  in  calculating  functional  distances.  We  calculate  functional  dis¬ 
tances  at  scale  j  using  the  diffusion  metric  d?( x,  y)  =  \J K>  (x,  x)  +  K$  (y,  y)  —  2/0  (x,  y) 
where  K  is  the  kernel  corresponding  to  the  diffusion  operator  T.  See  [13]  for  details. 

5.5  Hyper-parameter  fitting 

Selection  of  A  in  (2)  uses  .632  bootstrapping  [32].  This  method  uses  sampling  with 
replacement,  where  each  sample  has  a  1  —  1/n  probability  of  not  being  sampled, 
tending  to  produce  sparse  solutions  and  good  generalizability. 


5.6  Pathway  Analysis  in  Prostate  Cancer 

We  merge  Illumina  and  Asymetrix  data  by  matching  Entrez  Ids,  and  standardize 
each  row  of  genes  to  have  mean  0  and  variance  1. 

The  graph  of  pathways  (gene  sets)  is  deduced  from  the  matrix  Wij  =  —log {pij) 
where  pij  is  the  probability  of  overlap  between  gene  sets  i  and  j  based  on  Fisher’s 
Exact  statistic.  We  use  the  639  genesets  from  MsigDB  [22]  and  ignore  all  genesets 
comprised  of  10  genes  or  less. 

We  need  to  transform  the  data  matrix  Xq,  a  m  x  n  matrix  where  m  is  the  number 
of  probes,  to  Z  -  a  g  X  n  matrix  g  is  the  number  of  gene  sets.  This  is  done  using 
the  ASSESS  [20]  software  package  for  the  transformation  Xq  —>  Z .  which  measures 
gene  set  enrichment  variation  for  each  sample. 
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